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ABSTRACT 


Pinterest Image Search Engine helps millions of users discover in- 
teresting content everyday. This motivates us to improve the image 
search quality by evolving our ranking techniques. In this work, 
we share how we practically design and deploy various ranking 
pipelines into Pinterest image search ecosystem. Specifically, we 
focus on introducing our novel research and study on three as- 
pects: training data, user/image featurization and ranking models. 
Extensive offline and online studies compared the performance of 
different models and demonstrated the efficiency and effectiveness 
of our final launched ranking models. 


1 INTRODUCTION 


Various researches on learning to rank [3-6, 8, 12, 17, 21, 34, 37] 
have been actively studied over the past decades to improve both 
the relevance of search results and the searchers’ engagement. With 
the advances of learning to rank technologies, people might have 
a biased opinion that it is very straightforward to build a ranking 
component for the image search engine. This is true if we simply 
want to have a workable solution: in the early days of Pinterest 
Image Search, we built our first search system on top of Apache 
Lucene and solr [26, 32] (the open-source information retrieval 
system) and the results were simply ranked by the text relevance 
scores between queries and text description of images. 

However, in Pinterest image search engines, the items users 
search for are Pins where each of them contains a image, a hyper- 
link and descriptions, instead of web pages or on-line services. In 
addition, different user engagement mechanisms also make the Pin- 
terest search process vary from the general web search engines. We 
therefore have evolved our search ranking over past few years by 
adding various advancements that addressed the unique challenges 
in Pinterest Image Search. 

The first challenge rises from an important question: why users 
search images in Pinterest? As shown in Figure 1, Pinterest users 
(Pinners) can perform in total 60 actions towards the search re- 
sults Pins such as “repin”, “click-through”, “close up”, “try it" etc. In 
addition, users do have different intents while searching in Pinter- 
est [23]: some users prefer to browse the pins to get inspirations 
while female users prefer to shop the look in Pinterest or search 
recipes to cook. On one hand, flexible engagement options help 
us to understand how users search for images and leverage those 
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(a) Pinteres users can (b) Close up: Click one (c) The second click on 
perform various actions pin leads to a zoom- the close up page in (b) 
towards the results Pins in page. A further click goes to the external web- 
of the query “valentines on the “save" button is _ site, is named as “click” in 
day nails". called “Repin”. Pinterest. 


Figure 1: Pinterest Image Search UI on Mobile Apps. 


signals to provide a better ranking of search results; On another 
hand, the heterogeneity of engagement actions provides additional 
challenge about how we should incorporate those explicit feed- 
backs. In traditional search engine, a clicked result can be explicitly 
weighed more important than a non-clicked one; while in Pinterest 
ecosystem, it is very difficult to define a universal preference rule: 
is a “try it" pin more preferable than a “close up" pin, or vise versa? 

Another challenge lies in the nature of image items. Compared 
to the traditional documents or web pages, the text description of 
the image is much shorter and noisier. Meanwhile, although we 
understand that “A picture is worth a thousand words", it is very 
difficult to extract reliable visual signals from the image. 

Finally, much literature has been published on advanced learn- 
ing to rank algorithms (see related work section) and their real-life 
applications in industry. Unfortunately, the best ranking algorithm 
to use for a given application domain is rarely known. Furthermore, 
image search engine system has much higher latency requirement 
than recommendation system such as News Feed, Friend Recom- 
mendation etc. Therefore, it is also very important to strike the 
balance between efficiency and effectiveness of ranking algorithms. 

We thus address the aforementioned issues from three aspects: 


Data We propose a simple yet effective way to weighted com- 
bine the explicit feedbacks from user engagements into the 
ground truth labels of engagement training data. The engage- 
ment training data, together with human curated relevance 
judgment data, are fed into our core ranking component in 
parallel to learn different ranking functions. Finally, a model 
stacking is performed to combine the engagement-based 
ranking model with the relevance-based ranking model into 
the final ranking model. 

Featurization In order to address the challenge in extracting 
reliable text and visual signal from pins, advancements in 
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featurization that range from feature engineering, to word 
embedding and visual embedding, to visual relevance signal, 
to query intent understanding and user intent understanding 
etc. In order to better utilize the finding of why pinners use 
Pinterest to search images, extensive feature engineering 
and user studies were performed to incorporate explicit feed- 
backs via different types of engagement into the ranking 
features of both Pins and queries. Furthermore, the learned 
intent of users and other dozens of user-level features are 
utilized in our core machine learned ranking to provide a 
personalized image search experience for pinners. 

Modeling We design a cascading core ranking component 
to achieve the trade-off between search latency and search 
quality. Our cascading core ranking filters the candidates 
from millions to thousands using a very lightweight ranking 
function and subsequently applied a much more powerful 
full ranking over thousands of pins to achieve a much better 
quality. For each stage of the cascading core ranking, we 
perform a detailed study on various ranking models and 
empirically analyze which model is “better" than another 
by examining their performances in both query-level and 
user-level quality metrics. 


The remainder of this work is organized as follows. In Section 2, 
we first introduce how we curated training data from our own 
search logs and human evaluation platform. The feature represen- 
tation for users, queries and pins is presented in Section 3. We then 
introduce a set of ranking models that are experimented in different 
stages of the cascading ranking and how we ensemble models built 
from different data sources in Section 4. In Section 5, we present our 
offline and online experimental study to evaluate the performance 
of our core ranking in production. Related work is discussed in 
Section 6. Finally we conclude this work and present future work 
in Section 7. 


2 ENGAGEMENT AND RELEVANCE DATA IN 
PINTEREST SEARCH 


There are several ways to evaluate the quality of search results, 
including human relevance judgment and user behavioral metrics 
(e.g., click-through rate, repin rate, close-up rate, abandon rate etc). 
Therefore, a perfect search system is able to return both high rele- 
vant and high user-engaged results. We thus design and develop 
two relatively independent data generation pipeline: engagement 
data pipeline and human relevance judgment data pipeline. These 
two are seamlessly combined into the same learning to rank module. 
In the following, we share our practical tricks to obtain useful infor- 
mation from engagement and relevance data for learning module. 


2.1 Engagement Data 


Learning from user behavior was first proposed by Joachims [17], 
who presented an empirical evaluation of interpreting click-through 
evidence. After that, click-through engagement Log has became the 
standard training data for learning to rank optimization in search 
engine. Engagement data in Pinterest search engines can be thought 
of as tuples < q,u,(P,7) > consisting of the query gq, the user u, 
the set P of pins the user engaged, and the engagement map J 
that records the raw engagement counts of each type of action 
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over pins P. Note that here the notation user u denotes not only a 
single user, but a group of users who share the same user feature 
representation. 

However, as introduced earlier in Figure 1, when impression pins 
are displayed to users, they can perform multiple actions towards 
pins including click-through, repin, close-up, like, hide, comment, 
try-it, etc. While different types of actions provide us multiple 
feedback signals from users, they also bring up a new challenge: 
how we should simultaneously combine and optimize multiple 
feedbacks? 

One possible solution is that we simply prepare multiple sources 
of engagement training data, each of which was fed into the ranking 
function to train a specific model optimizing a certain type of en- 
gagement action. For instance, we train a click-based ranking model, 
a repin-based ranking model, a closeup-based ranking model re- 
spectively. Finally, a calibration over multiple models is performed 
before serving the models to obtain the final display. Unfortunately, 
we tried and experimented with hundreds of methods for model 
ensemble and calibration and was unable to successfully obtain 
a high-quality ranking that does not sacrificing any engagement 
metric. 

Thus, instead of calibrating over the models, we integrate mul- 
tiple engagement signals over the data level. Let /(p | g,u) denote 
the engagement-based quality label of pin p to the user u and query 
q. To shorten the notation, we simply use Ip to denote [(p | q,u) 
when the given query g and user u can be omitted with ambiguity. 
We thus generate the engagement-based quality label set L of pins 
P as follows. 

For each pin p € P with the same keyword query q and user 
features u, the raw label I, is computed as a weighted aggregation 
of multiple types of actions over all the users with the same features. 


That is, 
lp = y weer (1) 


where 7 is the set of engagement actions, c; is the raw engagement 
count of action t and wy is the weight of a specific action t. The 
weight of each type of action w; is reversely proportional to the 
volume of each type of action. 

We also normalize the raw label of each pin based on its position 
in the current ranking and its age to correct the position bias and 
freshness bias as follows: 


1 


=| Apos 2 
°\Tog(agep/=) + ie (2) 


lp 


where age, and pos, are the age and position of pin p, r is the 
normalized weight for the ages of pins, and A is the parameter that 
controls the position decay. 

Another challenge in generating a good quality engagement 
training data is that we always have a huge stream of negative 
training samples but very few positive samples that received users’ 
engagement actions. To avoid over learning from too many negative 
samples, two pruning strategies are applied: 


(1) Prune any query group (q, u) and its training tuples < q, u, 
(P, 7, L) > that does not contain any positive training sam- 
ples (i.e., Vp € P, Ip € L, Ip < 0). 


Demystifying Core Ranking in Pinterest Image Search 


The Query: {search_query} 


1. Please rate the relevance of the Pin to: {search_query}. 


The Search Result 


‘Make your judgement based primarily on the image, using the description below the image only as. 
complementary information. 


Very Relevant 
Relevant 
Not relevant 


4 2. Please check all boxes that apply. 
I'd love to see the great views on Na- 
boo again The Pin is missing 


The Pin contains adult or offensive content 


The Pin (image or text below) is in a foreign language 


Note: If you're stuck on a certain Pin, please use the training guide to review a few nuanced cases. 


Figure 2: Template for rating how relevant a pin is to a query. 


(2) For each query group, randomly prune negative samples if 
the number of negative samples is great than a threshold 6 


(ie, {p |p € P,lp < 0}| < 0). 


With the above simple yet effective ways, an engagement-based 
data can be automatically extracted from our Pinterest search Logs. 


2.2 Human Relevance Data 


While the aggregation of large-scale unreliable user search session 
provides reliable engagement training data with implicit feedback, 
it also brings up the bias from the current ranking function. For 
instance, position bias is one of these. To correct the ranking bias, 
we also curate relevance judgment data from human experts with 
in-house crowd-sourcing platform. The template for rating how 
relevant a Pin is to a query is shown in Figure 2. Note that each 
human expert must be a core Pinterest user and pass the golden-set 
query quiz before she/he can start relevance judgment in a three- 
level scale: very relevant, relevant, not relevant. The raw quality 
label ly € [0,2] is thus averaged over ratings of all the human 
experts. 


2.3 Combining Engagement with Relevance 


Clearly, the range of the raw quality label J, of the human relevance 
data differs a lot from that of the engagement data. Figure 3 reports 
the distribution of quality labels in a set of sampled engagement 
data and that of human judgment scores in human relevance data 
after downsampling the negative tuples. Even if we normalize both 
of them into the same range such as [0, 1], it is still not an apple- 
to-apple comparison. Therefore, we simply consider each training 
data source independently and feed each of which into the ranking 
function to train a specific model and then perform model ensemble 
in Section 4.3. This ad-hoc solution performs best in both of our 
offline and online A/B test evaluation. 


3 FEATURE REPRESENTATION FOR 
RANKING 


There are several major groups of features in traditional search 
engines, which, when taken together, comprise thousands of fea- 
tures [6] [12]. Here we restrict our discussion to how we enhance 
traditional ranking features to address unique challenges in Pinter- 
est image search. 
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Figure 3: Distribution of quality label |, across different data 
sources 


3.1 Beyond Text Relevance Feature 


As discussed earlier, the text description of each Pin usually is very 
short and noisy. To address this issue, we build an intensive pipeline 
that generate high-quality text annotations of each pin in the format 
of unigrams, bigrams and trigrams. The text annotations of one pin 
are extracted from different sources such as title, description, texts 
from the crawled linked web pages, texts extracted from the visual 
image and automatically classified annotation label. These aggre- 
gated annotations are thus utilized to compute the text matching 
score using BM25 [31] and/or proximity BM25 [33]. 

Even with the high quality image annotation, the text signal 
is still much weaker and noisier than that in the traditional web 
page search. Therefore, in addition to word-level relevance mea- 
surement, a set of intent-based and embedding-based similarity 
measurement features are developed to enhance the traditional 
text-based relevancy. 


Categoryboost This type of feature tries to go beyond similar- 
ity at the word level and compute similarity at the category 
level. Note that in Pinterest, we have a very precise human 
curated category taxonomy, which contains 32 L1 categories 
and 500 L2 categories. Both queries and pins were annotated 
with categories and their confidences through our multi-label 
categorizer. 

Topicboost Similar to categoryboost, this type of feature tries 
to go beyond similarity at the word level and compute simi- 
larity at the topic level. However, in contrast to the category, 
each topic here denotes a distribution of words discovered 
by the statistical topic modeling such as Latent Dirichlet 
allocation topic modeling [2]. 

Embedding Features The group of embedding features evalu- 
ates the similarity between users’ query request and the pins 
based on their distances on the learned distributed latent 
representation space. Here both word embedding [24] and 
visual embedding [16] [19] are trained and inferred via differ- 
ent deep neural network architectures on our own Pinterest 
Image Corpora. 


Our enhanced text relevance features play very important roles in 
our ranking model. For instance, the categoryboost feature was the 
15th important feature in organic search ranking model and was 
ranked as 1st in search ads relevance ranking model. 
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3.2 User Intent Features 


We derive a set of user-intent based features from explicit feedbacks 
that received from user engagement. 


Navboost Navboost is our signal into how well a pin performs 
in general and in context of a specific query and user segment. 
It is based on the projected close up, click, long-click and 
repin propensity estimated from previous user engagement. 
In addition to segmented signal in terms of types of actions, 
we also derive a family of Navboost signals segmented by 
country, gender, aggregation time (e.g., 7 days, 90 days, two 
years etc). 

Tokenboost Similarly, in order to increase the coverage, an- 
other feature Tokenboost is proposed to evaluate how well 
a pin performs in general and in context of a specific token. 

Gender Features Pinterest currently has a majority female 
user base. To ensure we provide equal quality content to 
male users, we developed a family of gender features to 
determine, generally, whether a pin is gender neutral or 
would resonate with men. We then can rank more gender 
neutral or male-specific Pins whenever a male user searches. 
For example, if a man searches shoes, we want to ensure he 
finds shoes for him, not women’s shoes. 

Personalized Features As our mission is to help you discover 
and do what you love, we always put users first and pro- 
vide as much personalization in results as possible. In order 
to do this, we rely on not only the demographical informa- 
tion of users, but also various intent-based features such as 
categories, topics, and embedding of users. 


User intent features are one of the most important features for core 
ranking and they help our learning algorithm learn which type 
of pins are “really” relevant and interesting to users. For instance, 
the Navboost feature is able to tell the ranking function that a pin 
about “travel guides to China ” is much more attractive than a pin 
about “China Map” (which is ranked 1st in Google Image Search) 
or “China National Flag” when a user is searching a query “China” 
in Pinterest. 


3.3. Query Intent Features 


Similar to traditional web search, we also utilize common query- 
dependent features such as length, frequency, click-through rate 
of the query. In addition to those common features, we further 
develop a set of Pinterest-specific features such as whether the 
query is male-oriented, the ratio between click-through and repin, 
the category and other intents of queries, and etc. 


3.4 Socialness, Visual and other Features 


In addition to the above features, there exists more unique features 
in Pinterest ecosystem. Since each ranking item is an image, dozens 
of visual related features are developed ranging from simple image 
score based on image size, aspect ratio to image hashing features. 

Meanwhile, in addition to image search, Pinterest also provide 
other social products such as image sharing, friends/pin/board fol- 
lowing, and cascading image feed recommendation. These products 
also provide very valuable ranking features such as the socialness, 
popularity, freshness of a pin or a user etc. 
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Figure 4: An illustrative view of cascading ranking 


4 CASCADING RANKING MODELS 


Pinterest Search handles billions of queries every month and helps 
hundreds of millions of monthly active users discover useful ideas 
through high quality Pins. Due to the huge volume of user queries 
and pins, it is critical to provide a ranking solution that is both 
effective and efficient. In this section, we provide a deep-dive walk 
through of our cascading core ranking module. 


4.1 Overview of the Cascading Ranking 


As illustrated in Figure 4, we develop a three-stage cascading rank- 
ing module: light-weight stage, full-ranking stage, and re-ranking 
stage. Note that multi-stage ranking was proposed as early as in 
NestedRanker [25] to obtain high accuracy in retrieval. However, 
only recently motivated by the advances of cascading learning in 
traditional classification and detection [29], cascading ranking [20] 
has been re-introduced to improve both the accuracy and the effi- 
ciency of ranking systems. Coincidently, the Pinterest Image Search 
System applied a similar cascading ranking design to that of the 
Alibaba commerce search engine [20]. In the light-weight stage, 
an efficient model (e.g., linear model) is applied over a set of im- 
portant but cheaply computed features to filter out negative pins 
before passing to the full-ranking stage. As shown in Figure 4, light- 
weight stage ranking successfully filters out millions of pins and 
restricts the candidate size for full-ranking to thousands scale. In 
the full-ranking stage, we select a set of more precise and expensive 
features, together with a complex model, and further following the 
model ensemble, to provide a high quality ranking. Finally, in the 
re-ranking stage, several post-processing steps are applied before 
returning results to the user to improve freshness, diversity, locale- 
and language-awareness of results. 

To ease the presentation, we use q, u, p to denote query, user 
and pin respectively. x denotes the feature representation for a 
tuple with query q, user u and pin p (see Section 3 for more details). 
I(p | q,u) is the observed quality score of pin p given query q and 
user u, usually is obtained from either the search log or human 
judgment (see Section 2). y is the ground truth quality label of pin 
p given query q and user u, which is constructed from the observed 
quality score I(p | q,u). Similar to l(p | q,u), we use s(p | q,u) 
to denote the scoring function that estimates the quality score of 
pin p given query q and user u. To shorten the notation, we also 
simply use Ip to denote [(p | q,u) and sp to denote s(p | q, u) when 
the given query q and user u can be omitted without ambiguity. £L 
denotes the loss function and S denotes the scoring function. 
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Table 1: A list of models experimented in different stages of 
the cascading core ranking. 


Stage Feature Model Is Pairwise? 
Rule-based = 
Light-weight | 8 featu 
seer Meee eae RankSVM [17] Pairwise 
GBDT [18] [34] Pointwise 
DNN Pointwise 
CNN Pointwise 
Pull aallifeauites RankNet [3, 4] Pairwise 
RankSVM [17] Pairwise 
GBRT [36] [37] Pairwise 
Rule-based = 
; GBDT [18] [34] Pointwise 
Re-rank: 6 feat = 
ee cannres  GBRT [36] [37] | _ Pairwise 
RankSVM [17] Pairwise 


4.2 Ranking Models 


As shown in Table 1, we experimented a list of representative state- 
of-the-art models with our own variation of loss functions and 
architectures in different stages of the cascading core ranking. In 
the following, we briefly introduce how we adopt each model into 
our ranking framework. We omitted the details of the Rule-based 
model since it is applied very intuitively. 

Gradient Boost Decision Tree (GBDT) Given a continuous and 
differentiable loss function £, Gradient Boost Machine [11] learns 
an additive classifier H? = a neh! (x) that minimizes L(H"), 
where 7 is the learning rate. In the pointwise setting of GBDT, each 
h’ is a limited depth regression tree (also referred to as a weak 
learner) added to the current classifier at iteration t. The weak 
learner hé is selected to minimize the loss function £L(H*~! + ;h*). 
We use mean square loss as the training loss for the given training 
instances: 


L£(ht) = 7 De — he (x))? (3) 


where n is number of training instances and the ground truth label 
y is equal to the observed continuous quality label I(p | q, u). 
Deep Neural Network (DNN) The conceptual architecture of the 
DNN model is illustrated in Figure 5(a). This architecture models 
a point-wise ranking model that learns to predict quality score 
s(p | q, uv). 

Instead of directly learning a scoring function S(q, u, p | 0) that 
determines the quality score of pin p for query q and user u given 
a set of model parameters @ [8], we transform the problem into 
a multi-class classification problem that classifies each pin into a 
4-scale label [1, 2, 3, 4]. Specifically, during the training phase, we 
discretize the continuous quality label I(p | q, u) into the ordinal 
label y € [1, 2, 3, 4] and train a multi-class classifier S(k | q, u, p, 0) 
that predicts the probability of pin p in class k. 

As shown in Figure 5(a), we use cross entropy loss as the training 
loss for a single training instance: 


K 


L(S.y) = — 1 fy = k} log S(k | q.u.p. 9) (4) 
k=1 


where K is number of class labels (K = 4 in this setting). 


Conference’17, July 2017, Washington, DC, USA 


Output Layer 


Hidden Layer 


Hidden Layer 


Multi-hot Encoding 


User Query Pin 


(a) Simple neural network 


Output Layer 


FC Layer 


Max Pooling 


Conv Layer 
User Query Pin 


(b) convolutional neural network 


Figure 5: Different ranking architectures 


In the inference phase, we treat the trained model as a point-wise 
scoring function to score each pin p for query q and user u using 
the following conversion function: 


s(p | qu) = Dk * S(k | q,u,p.8) (5) 
k 


Convolutional Neural Network (CNN) In this model, similar to 
the previous DNN model, the goal is to learn a multi-class classifier 
S(k | q,u,p, 0) and then convert the predicted probability of S(k | 
q.u, p, 9) into a scoring function s(p | q,u) using Eq. 5. As it is 
depicted in Figure 5(b), the architecture contains the 15* layer of 
convolutional layer, following the max pooling layer, with the ReLU 
activator, the 2" layer of convolutional layer, again following the 
max pooling layer and the ReLU activator, a fully connected layer 
and the output layer. 

Despite the differences in the architecture, the CNN model uses 

the same problem formulation, cross entropy loss function, and 
score conversion function (Eq. 5) as the DNN. 
RankNet Burges et. al. [3] proposed to learn ranking using a prob- 
abilistic cost function based on pairs of examples. Intuitively, the 
pairwise model tries to learn the correct ordering of pairs of docu- 
ments in the ranked lists of individual queries. In our setting, one 
model learns a ranking function S(q, u, pi, p;, 9) which predicts the 
probability of pin p; to be ranked higher than p; given query q and 
user U. 

Therefore, in the training phase, one important tasks is to extract 
the preference pair set P given query q and user uw. In RankNet, 
the preference pair set was extracted from the pairs of consecutive 
training samples in the ranked lists of individual queries. When 
applying RankNet to our Pinterest search ranking, the preference 
pair set is constructed based on the raw quality label I(p | q, u). For 
instance, p; is preferred over p; if I(p; | q,u) > l(p; | ¢, u). Note that 
the preference pair set construction is applied to all the following 
pairwise models. 
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Given a preference pair (p;, p;), Burges et. al. [3] used the cross 
entropy as the loss function in RankNet: 


L(S, y) = —yij log S(q, u, pi, pj, 9)-(—-yiz) log-S(q, u, pi. pj, 9) 
(6) 

where yj; is the ground truth probability of pin pj ranked higher 
than pj. 

The model was named as RankNet since Burges et. al. [3] used 
a two-layer Neural Network to optimize the loss function in Eq. 6. 
The very recent rank model proposed by Dehghani et. al. [8] can be 
considered as a variant of RankNet, which used Hinge loss function 
and a different way of converting the pairwise ranking probability 
into a scoring function. 
RankSVM In the pairwise setting of RankSVM, given the preference 
pair set P, RankSVM [17] aims to optimize the following problem: 


1 
arg min =||w||? +c)" » L(w! x; — w! xz) (7) 
w 2 i j,keP; 


A popular loss function used in practical is the quadratically smoothed 
hinge loss [35] such that £(e) = max(0, 1 — €)?. 

Gradient Boost Ranking Tree (GBRT) Intuitively, one can weigh 
the GBRT as a combination of RankSVM and GBDT. In the pairwise 
setting of GBRT, similar to RankSVM, at each iteration the model 
aims to learn a ranking function S(q, u, pi, pj, 9) that predicts the 
probability of pin p; to be ranked higher than p; given query q and 
user u. In addition, similar to the setting of GBDT, here the ranking 
function is a limited depth regression tree h’. Again, the decision 
tree hé is selected to minimize the loss £L(H*! + n¢h*), where the 
loss function is defined as: 


L£(h') = S'S) max(0, hi (xp) — hi (x) + €)? (8) 


i j,keEP; 


4.3. Model Ensemble across Different Data 
Sources 


In this section, we discuss how we perform calibration over multiple 
models that are trained from different data sources (e.g., engage- 
ment training data versus human relevance data). 

Various ensemble techniques [9] are proposed to decrease vari- 
ance and bias and improve predictive accuracy such as stacking, 
cascading, bagging and boosting (GBDT in Section 4.2 is a popular 
boosting method). Note that the goal here is not only to improve the 
quality of ranking using multiple data sources, but also to maintain 
the low latency of the entire core ranking system. Therefore, we 
here consider a specific type of ensemble approach stacking with 
relatively low computational costs. 

Stacking first trains several models from different data sources 
and the final prediction is the linear combination of these models. 
It introduces a meta-level and uses another model or approach to 
estimate the weight of each model, i.e., to determine which model 
performs well given these input data. 

Note that stacking can be performed both within the training of 
each individual model or after the training of each individual model. 
When stacking is applied after training each individual model, then 
the final scoring function is defined as 


s(p | q.u) = yse(p | q,u)+(1—y)sr(p | g,u) (9) 
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where se/s, is the predicted score of the model from engagement/human 
relevance judgment data and y is the combination coefficient. 

Stacking can also be performed within model training. For in- 
stance, Zheng et. al. [37] linearly combined the tree model that fits 
the engagement data and another tree model that fits the human 
judgment data using the following loss function: 


L£(h')=y S'S) max(0, hi (xp)h" (xj) +6)" +(1-y) ) (yi (0)? 
i j,keP; i 
(10) 
where yj is the relevance label for pin i and y controls the contri- 
bution of each data source. 

Here we chose to perform stacking at different stages based on 
the complexity of each individual model: stacking is performed in 
the model training phase if each individual model is relatively easy 
to compute, and is performed after training each individual model 
vise versa (e.g., each individual model is a neural network model). 

Note that differs from Eq. 10, we always use the same loss func- 
tion for different data sources. For instance, assume that we aim 
to train GBRT tree models from both engagement training data and 
human relevance data, we simply optimize the combined pairwise 
loss function: 


L(h') =y D) >) max(0, hi (xp) — hi (xy) + 6)” 
i j,kEeP; 


#O=V)D) D1 max(0, hf (axp) — hay) + 6)? 


n j,kEPn 


(11 


where each Pj/Pn denotes a preference set extracted from engage- 
ment /human judgment data respectively, and y again controls the 
contribution of each data source. The advantage of this loss func- 
tion is that y can also be intuitively explained as proportional to 
number of trees grown from each data source. 


5 EXPERIMENT 
5.1 Offline Experimental Setting 


The first group of experiments was conducted off-line on the train- 
ing data extracted as described in Section 2. For each country and 
language, we curated 5000 queries and performed human judgment 
for 400 pins per query. In addition, we built the engagement train- 
ing data pipeline from randomly extracting recent 7-days 1% search 
user session Log. The full data set was randomly divided while 
70% was used for training, 20% used for testing and 10% used for 
validation. In total we have 15 millions of training instances. 


5.1.1 Feature Statistics. We also analyzed the coverage and dis- 
tribution of each individual feature. Due to the space limitation, we 
report the statistics of the top important features from each group 
in Figure 6. 


5.1.2 Offline Measurement Metrics. In offline setting, we use the 
query-level Normalized Discounted Cumulative Gain (NDCG [14]). 
Given a list of documents and their ground truth labels J, the dis- 
counted cumulative gain at the position p is defined as: 


P 
I, 
DCGy = 2, Cea (12) 
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The NDCG is thus defined as: 
DCGp, 
NDCGp = (13) 
IDCGp 


where IDCGp is the ideal discounted cumulative gain. 

Since we have two different data sources, we derived two mea- 
surement metrics: NDCGi, for the human relevance data and NDCG¥, 
for the engagement data. 


5.2 Online Experimental Setting 


A standard A/B test is conducted online, where users are bucked 
into different 100 buckets and both the control group and enabled 
group can use as much as 50 buckets. In this experiment, 5% users 
in the control group were using the old in production ranking 
model, while another 5% users in the enabled group were using the 
experimental ranking model. 

The Pinterest image search engine handles in average 2 billion 
monthly text searches, 600 million monthly visual searches, 70 
millions of queries everyday and the query volume could be doubled 
during the peak periods such as Valentine’s day, Halloween etc. 
Therefore, roughly 7 millions of queries per day and their search 
results were evaluated in our online experiments. 


5.2.1 Online Measurement Metrics. In online setting, we use a 
set of both user-level measurement metrics and query-level mea- 
surement metrics. For query-level measurement metrics, repin 
per search (Qrepin), click per search (Q¢1ick), close up per search 
(Qclose up) and engagement per search (Qengagea) were the main 
metrics we used. This is because repin, click and close up are the 
main three types among in total 60 types of actions. The volume of 
close up action (user clicked on any of the pins to see the zoomed in 
image and the description of pins) is the dominant since this action 
is the cheapest. To the contrary, the volume of click action is much 
lower because click is more expensive to act (As shown in Figure 1, 
the click means that a user clicked the hyperlinks of the pins and 
went to the external linked web pages after closing up action). 
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Figure 7: Relative performance of RankSVM model to the 
baseline rule-based method in lightweight ranking stage. 


Table 2: Latency Improvement of RankSVM Lightweight 
Ranking 


Latency | Rule-based | RankSVM 
< 50ms 5% 8% 

50 - 200 ms 43% 61% 
> 200 ms 52% 31% 


In the user-level, we use the following measurement metrics: 


# of repined users # of close up users 


Urepin = 


i _ 
Sees uP # of searchers 


# of engaged users 


# of searchers 
# of clicked users 


Uclick = engaged = 


# of searchers # of searchers 


In order to evaluate the effect of re-ranking in terms of boosting 
local and fresh content, we also use the following measurement 


metrics: 
Le = # of local impressed pins ___# of fresh impressed pins 
au # of impressed pins one # of impressed pins 
Leentn = # of local pins repined _ _# of fresh pins repined 
rep! # of local pins aa # of fresh pins 
_ # of local pins clicked _ # of fresh pins clicked 
ics # of local pins # of fresh pins 
(15) 


where local pins denote that the linked country of pins matches 
that of users, and fresh pins denote the pins with ages no older than 
30 days. 


5.3 Performance Results 


5.3.1 Lightweight Ranking Comparison. The relative performance 
of RankSVM model to our very earlier rule-based ranking model in 
lightweight ranking stage is summarized in Figure 7. In offline test 
data set, the RankSVM model obtained consistent improvement over 
the rule-based ranking model. However, when moving to the online 
A/B test experiment, the improvement is smaller. These phenomena 
are very consistent across all of the ranking experiments: It is much 
easier to tune a better model than baseline model in offline than 
online. 

Although the quality improvement is relatively subtle, we greatly 
reduced the search latency when migrating the rule-based ranking 
to the RankSVM model. With the RankSVM model in the lightweight 


Conference’17, July 2017, Washington, DC, USA 


w 
8 
e 


GBDT 
GBT === 
DNN 


tp 
a 
R 


CNN SSSI 
RankNet @zzzza 


Relative Performance to RankSVM 
a 
x 
Relative Performance to RankSVM 


(a) Offline performance (b) Online performance 


Figure 8: Relative performance of different models to the 
baseline RankSVM method in full ranking stage. 


stage, we have higher confidences in filtering negative pins before 
passing the candidates into the full ranking stage. This subsequently 
improves the latency. As shown in Table 2, the percentage of search 
latency that is smaller than 50 ms is increased from 5% to 8% while 
the percentage of search latency that is larger than 200 ms is reduced 
from 52% to 31%. 

The results reported in Figure 7 and Table 2 perfectly illustrated 
how we achieve the balance between search latency and search 
quality with the lightweight ranking model. The RankSVM model 
for the lightweight stage was initially launched and serving all the 
US traffic starting April 2017. 


5.3.2 Full Ranking Comparison. In the full ranking stage, we 
conduct detailed experiments in off line to compare the performance 
of different models. As shown in Figure 8(a), for the engagement- 
based quality, overall, CNN > GBRT > DNN > RankNet > GBDT, where 
A> B denotes A performs significantly better than B. In terms of 
relevance-based quality, CNN > {GBRT, DNN, RankNet, GBDT}. 

Although Neural Ranking models perform very well in off line, 
currently our online model serving platform for neural ranking 
models incurs additional latency. The latency might be ignorable 
for recommendation-based products but causes bad experiences 
for searchers in terms of increased pinner waiting time etc. There- 
fore, we compute the ranking scores of DNN and CNN models in 
off line and feed these as two additional features into online tree 
models, denoted as GBRTyy and GBD yy respectively. The results 
of online experiment are presented in Figure 8(b). Based on the sig- 
nificant improvement of GBRT over the old linear RankSVM model, 
we launched the GBRT model in product in October 2017 and will 
launch the GBRTyn model to serve the entire search traffic soon. 


5.3.3. Re-ranking Comparison. Note that the main purposes of 
the re-ranking is to improve the freshness and localness of results. 
In the early days, our re-ranking applied a very simple hand-tuned 
rule-based ranking functions. For example, assume that users prefer 
to see more fresh content, we then simply give any pin with age 
younger than 30 days a boost or enforce at least a certain percentage 
of returned results are fresh. 

We spent much effort in feature engineering and migrate the 
rule-based ranking into machine-learned ranking. With multiple 
iterations of experiments, as shown in Figure 9, we are able to 
obtain comparable query-level and user-level performance with 
the rule-based methods and significantly outperformed the rule- 
based methods in terms of freshness and localness metrics. The 
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click-through rate and repin rate on fresh pins is increased by 20% 
when replacing the rule-based re-ranker with the GBRT model. 
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Figure 9: Relative performance of different models to the 
baseline Rule-based method in re-ranking stage. 


6 RELATED WORKS 


Over the past decades, various ranking methods [3-6, 8, 12, 17, 21, 
34, 37] have been proposed to improving the search relevance of 
web pages and/or user engagement in traditional search engine 
and e-commerce search engine. When we refer users to several 
tutorials [4, 21] for more detailed introduction regarding the area of 
learning to rank, we focus on introducing how the applications of 
learning to rank for image search engine in industry evolves over 
time. 

Prasad et. al. [28] developed the first microcomputer-based image 
database retrieval system. After the successful launch of the Google 
Image Search Product in 2001, various image retrieval systems 
are deployed for public usage. Earlier works on image retrieval 
systems [7] focus on candidate retrieval with the image indexing 
techniques. 

In recent years, many works have been proposed to improve 
the ranking of the image search results using visual features and 
personalized features. For instance, Jing et al. [15] proposed the 
visualrank algorithm which ranks the Google image search results 
based on their centrality in visual similarity graph. On another 
hand, How to leverage user feedbacks and personalized signals for 
image ranking were studied in both Yahoo Image Corpora [27], 
Flickr Image Corpora [10] and Pinterest Image Corpora [23]. In par- 
allel to industry applications, research about Bayesian personalized 
ranking [30] has been studied to improve the image search from 
implicit user feedbacks. 

In addition to general image search products, recently many 
applications have also focused on specific domains such as fashion!, 
food”, home decoration? etc. This trend also motivates researchers 
to focus on domain-specific image retrieval systems [1, 13, 22]. In 
Pinterest, while we have focused on the four verticals: fashion, food, 
beauty and home decoration, we also aim to help people discover 
the things they love for any domain. 


‘https://www.shopstyle.com/ 
2.www.supercook.com/ 
3https://www.houzz.com/ 
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7 CONCLUSION AND FUTURE WORKS 


We introduced how we leverage user feedback into both training 
data and featurization to improve our cascading core ranking for 
Pinterest Image Search Engine. We empirically and theoretically 
analyzed various ranking models to understand how each of them 
performs in our image search engine. We hope those practical 
lessons learned from our ranking module design and deployment 
could also benefit other image search engines. 

In the future, we plan to focus on two directions. First, as we have 
already observed good performance of both DNN and CNN ranking 
models, we plan to launch and serve them on-line directly instead 
of feeding their predicted scores as new features into tree-based 
ranking models. Second, many of our embedding-based features 
such as word embedding, visual embedding and user embedding 
were trained and shared across all the products in Pinterest such as 
home feed recommendation, advertisement, shopping etc. We plan 
to obtain the search-specific embedding features to understand the 
“intents” under the search scenario. 
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